Group membership failure detection: a simple protocol and its probabilistic analysis

نویسندگان

  • Michel Raynal
  • Frédéric Tronel
چکیده

A group membership failure (in short, a group failure) occurs when one of the group members crashes. A group failure detection protocol has to inform all the non-crashed members of the group that this group entity has crashed. Ideally, such a protocol should be live (if a process crashes, then the group failure has to be detected) and safe (if a group failure is claimed, then at least one process has crashed). Unreliable asynchronous distributed systems are characterized by the impossibility for a process to get an accurate view of the system state. Consequently, the design of a group failure detection protocol that is both safe and live is a problem that cannot be solved in all runs of an asynchronous distributed system. This paper analyses a group failure detection protocol whose design naturally ensures its liveness. We show that by appropriately tuning some of its duration-related parameters, the safety property can be guaranteed with a probability as close to one as desired. This analysis shows that, in real distributed systems, it is possible to achieve failure detection with a negligible probability of wrong suspicions.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

The Timewheel Group Membership Protocol

We describe a group membership protocol, called the timewheel group membership protocol, for a timed asynchronous distributed system. This protocol is a part of the timewheel group communication service that supports multiple group communication semantics simultaneously. The timewheel group membership protocol is unique in several respects. First, it has been designed for a timed asynchronous d...

متن کامل

Node Failure Detection and Membership in CANELy

Fault-tolerant distributed systems based on fieldbuses may benefit to a great extent from the availability of semantically rich communication services, such as those provided by group communication, clock synchronization, membership and failure detection. This is specially true of distributed critical control applications. However, the migration of those services to the realm of simple fieldbus...

متن کامل

A Probabilistically Correct Leader Election Protocol for Large Groups

This paper presents a scalable leader election protocol for large process groups with a weak membership requirement. The underlying network is assumed to be unreliable but characterized by probabilistic failure rates of processes and message deliveries. The protocol trades correctness for scale, that is, it provides very good probabilistic guarantees on correct termination in the sense of the c...

متن کامل

Reliable probabilistic communication in large-scale information dissemination systems

Reliable group communication is important for large-scale distributed applications such as information dissemination systems. The challenging issue in this context remains scalability. The computation time and amount of data dedicated to the reliability mechanism should remain manageable as the number of nodes in a system grows, and no bottleneck should emerge. Probabilistic algorithms has prov...

متن کامل

Improving the Quality of Service of Failure Detectors with SNMP and Artificial Neural Networks

A failure detector is an important building block for fault-tolerant distributed systems: mechanisms such as distributed consensus and group communication rely on the information provided by failure detectors in order to make progress and terminate. As such, erroneous information provided by the failure detection may delay decision-making or lead the upper-layer mechanism to take incorrect deci...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:
  • Distributed Systems Engineering

دوره 6  شماره 

صفحات  -

تاریخ انتشار 1999